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Abstract 

In this paper we present a linear programming solution for sign pattern recovery of a sparse 
signal from noisy random projections of the signal. We consider two types of noise models, input 
noise, where noise enters before the random projection; and output noise, where noise enters 
, after the random projection. Sign pattern recovery involves the estimation of sign pattern of 

^ ' a sparse signal. Our idea is to pretend that no noise exists and solve the noiseless £i problem, 

namely, min ||/3||i s.t. y = Gfi and quantizing the resulting solution. We show that the quantized 
\ solution perfectly reconstructs the sign pattern of a sufficiently sparse signal. Specifically, we 

show that the sign pattern of an arbitrary k-sparse, n-dimensional signal x can be recovered 
I with SNR = ri(logn) and measurements scaling as m = 0(fclogn/fc) for all sparsity levels 

00 . k satisfying < fc < an, where a is a sufficiently small positive constant. Surprisingly, this 

' bound matches the optimal Max-Likelihood performance bounds in terms of SNR, required 

Qv^ . number of measurements, and admissible sparsity level in an order-wise sense. In contrast to 

' our results, previous results based on LASSO and Max-Correlation techniques cither assume 

00 . significantly larger SNR, sublinear sparsity levels or restrictive assumptions on signal sets. Our 

proof technique is based on noisy perturbation of the noiseless £i problem, in that, we estimate 
the maximum admissible noise level before sign pattern recovery fails. 

• ^ 

■ 1 Introduction 

The problem of recovering a sparse signal from noisy projections arises in many real world sensing 
applications [H [2] . Motivated by these reasons we consider the problem of estimating x based on 
noisy random projections, which we refer to as the Output Noise Model: 

y = Gx + e (1) 

Here x £ M" is a sparse signal with support size k. We assume that the minimum absolute value of 
the non-zero components of the sparse signal x is bounded from belo'wS by Xmim which we assume 
without loss of generality to be equal to one. G G is a matrix chosen from an IID Gaussian 
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ensemble with its components Gij ~ A/'(0, The noise vector e is assumed to be Gaussian with 

IID components, with each component ej ^ A/'(0, -g}^), and independent of G. Note that the term 
SNR viewed in this normahzed setting is also the inverse of the noise variance. 

Notice that the setup of Eq. [1] parameterizes the problem in terms of two parameters namely, 
SNR and the number of measurements. 

In many cases such as system identification, active sensing, and sensor networks, noise can also 
arise at the input. We refer to such situations as the Input Noise Model. Motivated by these 
instances we also study recovery of x from the following measurements: 

y = Gz = G{x + w) (2) 

where, G, x are as in Equation [TJ We let w be an arbitrary deterministic £00 bounded perturbation 
to a sparse signal x. SNR in this case is the inverse of the square of the ^00 norm of the noise, 
i.e., SNR = (||i«||oo)^^- Note that in this setting, x is the sparse approximation to the composite 
signal z = x + w. This situation is related to the so called approximately sparse or compressible 
signals (see [3]). The deterministic w readily generalizes to the case when u; is a Gaussian random 
vector and we also state results for this case. 

In Compressive Sensing the goal is to reconstruct the signal, x, with significantly fewer mea- 
surements m = dim(y) than the dimension of the signal n = dim(3;), by exploiting signal sparsity. 
The noiseless problem (e = 0, w = 0) as well as its noisy counterpart have been the subject of 
intense research [H El El [Gj [7] . For the noiseless problem it is well known that if x has fewer than k 
non-zero components, it can be perfectly recovered if and only if every sub-matrix of G formed by 
choosing 2k arbitrary columns of G has full column rank. The reconstruction of an arbitrary sparse 
signal, x, from random projections, Gx can be stated as a combinatorial optimization problem, 
which is known to be NP-hard [7]. In [H El El [6l [H [9] it is shown that for sufficiently small fc, the so 
called (-1 relaxation can lead to exact recovery if the sensing matrix G satisfies additional properties. 
For example, Donoho et. al. [4] show that m = Q{k\og{n/k)) measurements is sufficient for the 
recovery of any k sparse vector of length n provided the measurements are exact. 

In the noisy case perfect recovery is generally impossible for continuous valued signals and 
an estimate x that closely approximates x in some distance measure is desired. The distances 
commonly considered include the £2 distance [HI E] , £1 distance [TO] and sign-pattern recovery [111 
[T2I [T3l I14j . Sign-pattern recovery, which is the focus of this paper, deals with exactly recovering 
the sign pattern of the components of an arbitrary sparse signal x. 

The problem of sign pattern recovery is motivated many problems such as the graph topology 
identification [12] where the mean squared error criterion provides an insufficient characterization 
of the solution. Two different approaches to the sign pattern recovery problem has been studied 
in the literature. One line of research has developed algorithm-independent information theoretic 
performance hounds to characterize fundamental limits on SNR, the number of measurements, and 
tolerable sparsity level required for exact sign pattern recovery from noisy measurements [151 \W[ \T3\ 
I14j . In Aeron et. al [H] it is shown that for the setup considered in Equation [Tj an SNR = f2(log(n)) 
and number of measurements m = i}{klog{n/k)) is both necessary and sufficient for sign pattern 
recovery of any signal x with sparsity k < an, where, a is a sufficiently small positive number. 

This paper adopts an algorithmic approach to sign pattern recovery and is based on our prelimi- 
nary work [T7]. Our algorithm consists of two steps: (1) In the first step we solve min ||/3||i s.t. y = 
Gf3, with the data y generated noisily, i.e., y = Gx + e oy y = G{x + w). The solution to the 
optimization problem is then quantized. (2) In the second step we solve a least squares regression 
to improve our estimates obtained in the first step. 
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Remarkably, it turns out that our scheme essentially matches, in an order wise sense, the 
algorithm-independent necessary conditions on SNR and the number of measurements required 
for sign pattern recovery. In comparison, as we describe in Section 11.21 our results are significantly 
stronger than bounds derived for other algorithmic approaches such as LASSO \12\ [TT] and Max- 
Correlation approach |13j . We also derive corresponding results for the input noise model based 
on our linear programming algorithm. While information theoretic results for this model has 
been developed [13], to the best of our knowledge, our paper is the first to report corresponding 
algorithmic results. 

The paper is organized as follows. We first describe the notation used throughout the paper 
in Section 11.11 In Section 11.21 we present an overview of related work and also describe the main 
contributions of this paper. Section [2] describes the thresholded basis pursuit (TBP) algorithm. 
Section [3] develops the main results for sign pattern recovery for the input noise model. The 
proof for sign pattern recovery is broken up into several steps in Section HI First, we establish 
sign pattern recovery in the linear sparsity regime. Then in the following section we develop sign 
pattern recovery for the general case where both linear and sublinear regimes are considered. In 
Section m these results are extended to the output noise model. Finally, we present some numerical 
results in Section [6l 



1.1 Notation 

We use capital letters to denote matrices and usually use small letters to denote signal vectors. 
The jth component of a signal e is denoted as ej. A matrix U € RP^*? where p > q {p < q) is said 
to be orthonormal if each column (row) has unit norm and its columns (rows) are all orthogonal. 

In the noisy sensing model y = Gx + e or y = G{x + w), the sensing matrix G is of size m x n. 
Correspondingly we have x,w £ M" and y, e G M™. We denote the support and sign pattern of x 
as: 

hupp = {j I Xj / 0}; = {j : Xj > 0}; /"pp = {j : xj < 0} (3) 

and denote 

Xmin = .min \xj\ := 1 (4) 

as the minimum magnitude of x on the support. We assume without loss of generality that Xmin = 1 
and we use x^i^ or substitute the number one whenever convenient. The elements on the support 
is denoted by: 

Xsupp = {xj)jeisupp (5) 

The sparsity k is the size of the support |/supp| := i^{Isupp}- Whenever we consider a linear sparsity 
regime, namely, the sparsity, k, of the signal increases in proportion to the signal dimension, n, we 
introduce a parameter a to denote the sparsity ratio ^ . We also introduce a parameter C to denote 
the dimension to measurement ratio ^ whenever this ratio is a constant. Specifically, we let 

a = — , C = — (6) 

n m 

Let X be an estimate of x based on y. We denote by: 

isupp = {j I Xj 7^ 0}; I^upp ~ {j '■ Xj > 0}; Isupp ~ {j ■ Xj < 0} (7) 
We need the following notations to denote false alarms and misses: 
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and 

~ \^supp\ ~ {\I^upp ^ ^suppl ~^ \-^supp ^ Kuppl) (9) 

We also make a note of some probabilistic statements used in the paper. We use Pr(-) to denote 
probabilities of events; E(-) to denote expectations; E(z | v) to denote conditional expectation of z 
given v; and ly to denote the indicator function for the set V. We also often state that a random 
variable, z, satisfies 

\\z\\ < 7, w.p. > 6 

to mean that, Pr(||z|| < 7) > 5. 

We adopt the family of Bachmann-Landau notations. Specifically, if we say f{n) = Q{g{n)) we 
mean that there is a positive integer no and numbers Ai, A2 G such that for all n > no we have 
'^ibl^)! — 1/(^^)1 — A2|(7(n)|. By /(n) = fl{g{n)) we mean that there is a number A G M"*^ such 



that lim^^oo 

/(") 



/(") 



> A. By /(n) = 0{g{n)) we mean that there is a number A G M"*" such that 



lim„_ 



< A. Finally, by /(n) = o{g{n)) we mean that lim.„_ 



/C«) 



9{n) 



0. 



1.2 Overview of Related Work &: Our Main Contributions 

The information-theoretic algorithm-independent necessary conditions for support recovery from 
noisy random projections for the setup described in Equation [1] have been developed by several 
authors [HI [El [HI [l3]. Sufficient conditions based on max-likelihood has appeared in [151 [H]- 
Specifically, the following result appears in Aeron, Saligrama &: Zhao |14j : 

Theorem 1.1. No algorithm can recover the support for the model given by Equation{l\ if SNR = 
o(log(n)). Furthermore, if SNR = 0(log(n)) and if the number of measurements, m = o{k\og{n/k)), 
then support recovery is impossible. Conversely, if m = ^}{klog{n/k)) and SNR = r2(log(n)) then 
the max-likelihood algorithm can exactly recover the support of the signal for the model given by 
Equation[l\ with high probability for any sparsity k < an, where a is a sufficiently small positive 
number. 

The main contribution of this paper is that we can essentially achieve these bounds by ba- 
sically using a linear programming algorithm. In recent years, researchers have focused on sign 
pattern recovery with convex relaxations such as LASSO [111 [T2] as well as max-correlation based 
approach [13]. We will describe these approaches in some detail here. 



LASSO: The i?i-constrained quadratic programming, commonly referred to as LASSO (Least 
absolute shrinkage and selection operator), solves the following optimization problem: 

min i||y-G/3||i + A||/3||i 

where, A is a tuning parameter, which is carefully selected to realize a meaningful solution. The 
performance analysis of sign pattern recovery for the output noise model of Equation [T] has been 
recently characterized by Candes et. al. [11] and Wainwright [12]. To simplify the analysis these 
authors seek a bound on the number of measurements and SNR to exactly recover the support. 
This means that we need to determine whether there is a suitable choice of A, SNR and m such 
that the solution x of LASSO satisfies: 

^supp ~ -^suppi Kupp ~ ^supp (10) 
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Remarkably it turns out that for a suitable choice of A, Equation [10] is satisfied. However, this 
choice of A turns out to be conservative relative to the achievable bounds of Theorem ll.li For 
instance, |12j requires high SNR for support recovery. In particular the author shows that in the 
high SNR limit, the number of measurements m = Q{klog{n — k)) is both necessary and sufficient 
for accurate sign pattern recovery. Candes et. al. [TT], gives tighter SNR bounds for exact support 
recovery. They show for the setup of Equation [1] if SNR > 200 log (n); the maximum sparsity 

ratio is sublinear, i.e., k = O ^ iog(^n) ) ' support of the signal x is uniformly distributed over 
all possible choices of support; then if the number of measurements, m = r2(/c log(n)), exact sign 
pattern recovery is guaranteed with high probability. 

In contrast our analysis does not impose restrictions on the signal set and we admit sparsity 
ratios in the linear regime. We establish that an SNR = r2(log(n)) and m = Q{klog{n/k)) 
measurements are sufficient for exact sign pattern recovery with high probability. In practice these 
differences appears to be even more significant on numerical examples as illustrated in the Figured] 
The primary reason for this discrepancy can be attributed to the difference in our approaches. Our 




Figure 1: Left Figure: TBP vs. LASSO for perfect support recovery with different k for the setup of Eq. [T] 
with n = 200 and m = 100 and SNR = 61og(n). Sparsity k is varied from to 60. The success probability 
is computed based on 40 trials for each sparsity k. Success is declared if no false alarms or miss detections 
exist. Right Figure: n, SNR fixed as in left figure and fc = 10; m is increased and the success probability is 
computed based on 80 trials. 

approach seeks to assert that the true support set of x is always contained (with high probability) 
in the support set of the thresholded LP solution. The residual non-zero components of x are 
significantly small and can be thresholded out. In other words, we do not attempt to exactly 
recover the support. In addition to these advantages we also point out that unlike LASSO we do 
not employ any tuning parameter on the quadratic penalty term. Indeed, from the analysis of 
\12\ [TT] it appears that this tuning parameter must be chosen as a function of the noise level, which 
our algorithm does not require. We describe this aspect in more detail in Section [6] 

Max-Correlation Approach: Fletcher et. al. [I3|, present necessary and sufficient conditions 
for sparsity pattern recovery based on maximum correlation estimator. The authors establish that 
the maximum correlation estimate is close to the necessary condition for sparsity pattern recovery. 
These results are stated under a different setup from that of Section 11.11 In particular, they 
introduce a different notion of SNR and a notion of mean-to-average ratio (MAR). In particular. 
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their sufficiency bound after appropriate substitutions turns out to be: 



m> (8 + 6) 



m 



+ 



log(n — k). 



When this is translated to our setup with Xmm = 1, it turns out that this inequahty will never 
hold asymptotically. It implies that this sufficient bound is actually a hybrid bound on both 
a^min, fn and sparsity k. To further illustrate the issue consider the situation when Xsupp = 
(xmin, 2xmm) • ■ ■ > ^a^min) their bouud on the number of measurements says that the number of 



measurements must scale as m = Q.{k^ log 



n) 



2 The Thresholded Basis Pursuit Algorithm 

Here we propose a new LP based algorithm, namely Thresholded Basis Pursuit (TBP). The analysis 
of the algorithm highlights the fact that SNR level is an important aspect in addition to the number 
of measurements. As described earlier in Section 11.21 our analysis shows that solution to the LP 
has relatively small non-zero components that do not belong to the support set of the true signal x. 
In addition we show that the support of the true signal is always recovered with high probability. 
There is a difference between the proof technique of |lH ll2j and ours. |1HI12| investigate conditions 
such that LASSO solution leads to Xj = for j Isupp- In contrast we seek solutions such that 
components outside the support set are relatively small. This relaxation helps us in bridging the 
order gap between LASSO and the linearly achievable sparsity through max-likelihood decoder. 
The algorithm is composed of two steps: 



Step 1: Apply the Basis Pursuit 

minimize ||/3||i subject to y = G(3 



Step 2: Threshold the solution x of Step 1 if and only if it's small, i.e., 

Q{xj) 



if I Xj I < 2 a^min ■ — 2 ' 



Xj otherwise. 



where we have assumed without loss of generality that x-^ain = 1- The above algorithm will be 
referred to as TBP. We list below the main steps involved in the analysis of TBP. We then refine 
these estimates using a least squares regression (see Section 13. 3p . 

(A) Linear Regime for Input Noise Models: It turns that the input noise model described 
in Eq. [2] in the linear regime namely, k = ^(an) is the simplest to analyze. We analyze 
this case by considering the small noise limit when exact recovery can be guaranteed using 
mean-squared error bounds. It is easy to show (see Theorem 13.11 and Corollarv 13. 2p that in 
the small noise limit the required number of measurements to guarantee exact recovery scales 
as 0{k\og{n/k)) = 0(nlog(l/a)). Freezing the number of measurements at this level we 
increase the level of noise and characterize the limit when the estimated support set does not 
contain the true support set. It turns out (see Theorem I3.4l f A) and Theorem I3.5f A)) that 
this limit is reached precisely when ||w||oo = 0(l/ylog(n)). These results can be directly 
extended to the case when w is a Gaussian random vector and we describe this situation 
subsequently. 
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(B) General Case for Input Noise Model: We then extend the results (see Theorem l3.4r B) 
and Theorem I3.5p obtained for the linear regime to the general sparsity case where k < an. 
The general sparsity case contains both the linear and the sublinear regime and therefore 
requires additional constraints on the nois^. This is because when we only assume an i^o 
bound on w and the sparsity of x is sublinear in n the ratio of signal power to noise power can 
be vanishingly small, i.e., as small as 0(log n/n). In the linear case this ratio is no smaller 
than 0(logn). 

(B) Output Noise Model: We then combine linear regime (A) and the general case (B) for 
the input noise model and derive results for the output noise model (see Theorem 15. ip . The 
basic idea here is to convert the output noise, e, into an equivalent input noise, w. The main 
complication here is that, in doing so, some correlation between G and the equivalent noise 
w is introduced and we describe how this can be handled in our analysis. 

3 Input Noise Case 

In this section, for clarity of presentation, we state the main results based on the input noise model 
y = G{x + w). Similar results will be extended to output noise model y = Gx + e in Section [5l 

We provide a brief outline of the proof by further describing each of the steps described in the 
previous section. 

1. Weak Support Recovery: Here we show (see Section [3.ip that the solution to the linear 
program (Step 1) of the TBP satisfies: 



The result implies that for sufficiently small w the support of x contains the support of x. 

2. Support Detection: We now increase the noise level while keeping the number of measure- 
ments frozen at the noiseless level and characterize the level at which the support set (and 
the sign pattern) of the LP solution no longer contains the support of x. The probability 
of missed detection is described in Section 13.21 in Theorem 13.41 We then apply least squares 
regression (see Theorem 13. 5p to ensure no false support detections. The proof of Theorem 13.41 
appears in Section U] and is based on the following steps: 

• Null Space Characterization: We observe that the LP minimization problem can be 
recast as follows (see Section WA\ : 




(11) 



min llx + Tx; + 



where, the random matrix A is in the null space of G (recall components of G are 
IID Gaussian distributed) and the rows of A are normalized. Suppose v is the optimal 
solution. For the linear sparsity regime, namely, k = an for a > 0, it follows directly 
from the £2 approximation error (Equation llip that 



\\Av\\2 = \\x -{x + w)\\2 = OiVn) 



with high probability for a deterministic w with ||w||oo = 0(1). By exploiting the 
properties of the normalized matrix A it turns out that \\v\\2 = 0{^/n). 



^The pessimistic results for the input noise case is fundamental and has also been observed in Aeron et. al.[14j 
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• Conditional Independence Lemma: Note that if A and v are independent then any 
component of the vector Av is 0(1). This would result in a manageable perturbation 
on the non-zero components of x. However, when A and v are correlated, which is the 
case here, some components of Av could be large. In Section [4.21 we show that A and v 
are only weakly correlated and so, no component Av can be large on the support set. 

3.1 Weak Support Recovery Based on Mean-Squared Error Bound 

Our main goal in this section is to: (a) Present a squared norm approximation result for the TBP 
algorithm; (b) Derive a weak support recovery result based on a squared norm distortion bound. 
By weak support recovery we mean that either a large fraction of the support can be recovered with 
||w^||oo = 0{l/\/logn) or the support can be completely recovered with sufficiently small noise, w. 
We will need the so called restricted isometry property introduced in [6]. 

Definition 3.1. Given a matrix G and any set of column indices T, we use Gt to denote the 
m X |r| submatrix of G composed of the corresponding columns in T. We further denote xj- as 
the vector whose support is on T. Then we say that a matrix G satisfies the Restricted Isometry 
Property(RIP) condition with parameter 6k if 

[l - 6k)\\xT\\l < WGtxtWI < {I + dk)\\xT\\l y XT^ T s.t.\T\ < k (12) 

We can think of 6^ as a functional mapping that maps positive integers k to the unit interval. 
Thus refers to the RIP constant when the cardinality of the set T is smaller than 2k. Note that 
this definition only applies to column-normalized matrices. 

The definition needs to be modified if the sensing matrix is not normalized. Throughout this 
paper, we assume G is a column normalized Gaussian matrix, i.e., each entry Gij is i.i.d. sample 
drawn from Af{0, ;^). The RIP constants for this situation is described in [TB]. For our purposes 
it turns out that we need 62k < 1/7. For the Gaussian matrix assumed in this paper it turns out 
that, 

Fr{62k < 1/7) > 1 - exp(-cim), if m > 03(2/0) log (^) (13) 
where ci,C2 are constants, which are independent of n, m, k. 

Theorem 3.1. Suppose the sparsity k is such that 62k < 7- Consider the input noise model 
y = G{x + w) with x having sparsity k. Then the optimal solution x of the Basis Pursuit (i.e., the 
solution to Step 1 of TBP) satisfies the following inequality, 

\\x-{x + w)\\2<C,^^. (14) 

where the constant Gg depends only on 62k- For G drawn from an IID Gaussian ensemble as 
in Eq. [H it follows that the above equation holds with probability greater than 1 — exp(— cim) for 
m > C2{2k)\og{n/2k). 

The proof techniques for this theorem are borrowed from [3] where the mean squared error 
bound of LASSO is derived. The detailed proof is in the appendix (Section [7]). 

Remark 3.1. Note that the requirement that 62k < 7 is stronger than the condition for noiseless 
case, 52k < \/2 — 1, (see [H]). This is the cost we pay for noisy recovery. 

We can translate the above result into characterizing the support error as follows. 
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Corollary 3.2. Assume w is £00 bounded, i.e., ||w||oo < ea^min for some e < 1/2 and G is as in 
Eq. \^ Then the TBP algorithm ensures 



n ( 2C e \ ^ 
max{iV^,iVy} < - f ^-^j w.p.>l-e 



■cim 



where Nm represents the number of miss-detected components and Nj represents the number of 
false-alarms defined in Equationl^il^ 



Proof. Without loss of generality we assume Xmin = 1 as described earlier. We only prove < 
a ^ i-2e ) bound of Nf follows from the exact same reasoning. By definition of N„i, TBP 

algorithm misses components of a; as in Igupp- This implies the £2 error ||x — (x + w)\\2 is at 

least y'^^^mC^'^-'e)^ because if component i is missed then \xi — (x + w)i\ is at least | — £• Then by 
applying Theorem 13. H we have, 



Solving this inequality gives an upper bound of Nm- □ 

Remark 3.2. The above theorem provides an asymptotic bound to the miss-d-Gtection rate pm '. — 
TT^- Corollary 13.21 implies 

We can see that e controls the miss detection rate. For example, if e = 0{-^^=^) and k scales 
with n, then < '^(iHgn)' '^^lich vanishes when n — )• oo. On the other hand, the condition 
e = 0{-^j=^) is just saying SNR = r2(logn). This implies that when SNR = r2(logn), the miss 
detection rate vanishes asymptotically via TBP. 

Alternatively, perfect support recovery is achievable with sufficiently high SNR. The following 
result (Lemma 13. 3p will be useful later. 

Lemma 3.3. Suppose \\w\\oo < e and e is sufficiently small. The k columns of G that correspond 
to the correct support are part of the optimal basis with probability > 1 — e"'^^"*. 



Proof. We know from Theorem 13 . II that with probability > 1 — e"'^^"^ Basis Pursuit ensures — (x+ 
w)\\2 < Cg- Now we choose a sufficier 
|xmin- Given ||w||oo < Theorem 13.11 implies 



w}\\2 < Cg- Now we choose a sufficiently small e such that e < jx^i^ and Gg-nej sj an/2 < 



\Xi - Xi\ < \Xi - {Xi + Wi)\ + \Wi\ < \\x - {x-\- w)\\2 -\-e<Cs- \\w\\i / ^/n -\- e < ^Xmin- 

Since Basis Pursuit is an LP algorithm, the optimal solution must be a vertex of the polytope. 
Denote Gi as the optimal basis in G for this optimal solution. Since the sign pattern of x is correctly 
recovered with probability > 1 — e~^^^ as shown above, the k columns of G that correspond to 
the correct support must be included in Gi with this probability. Otherwise, if the i-th column 
(i G Isupp) is not selected into the optimal basis, then Xi = but we know the correct > Xmin; 
which contradicts the above inequality |xi — Xjl < ixmin- □ 
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3.2 Support Detection 



The result in the previous section is asymptotic and does not provide conditions for exact support 
recovery. Quite surprisingly, we can prove this stronger result based on Lemma 13.31 and the theory 
of duality in linear programming. The precise setup of input noise model is defined as follows. 

Definition 3.2 (Input noise model). Sensing model y = G{x + w) where G is a Gaussian matrix 
with i.i.d. entries J\f{0, ;^). We assume w is a deterministic £oo bounded noise ||w||oo < eo where eo 
will be specified and always on the order 0(l/-y/logn). Recall that we scale the signal x such that 

^min — !• 

Theorem 3.4. Consider the input noise model in Definition \3.^ 

(A) Suppose the support size, k, of x is in the interval an/2 < k < an where a is a positive real 
number satisfying 62an < 1/7 (see Eg. [T3\) . Also assume ||w^||oo ^ ^0 where 




(yn- 



n 



Then the TBP algorithm satisfies: 

Pr(iV„, = 0) > 1 - f ^ ^ + 2.24-("-'") + 2e-=i™ + 26" 
V V vr log n 

where the probability is taken with respect to the Gaussian IID ensemble G described in Equation\^ 
and ci is described in Equation\T3{. 

(B) Assume l^o bound on noise w, namely, \\w\\oa < eo '■= l/(5Cs-v/logn). In this case we also 
impose the additional assumption \\w\\i < k/logn, then for any arbitrary support size, k < an, 
there is constant C such that for m > Gk\og{n/k) the TBP algorithm recovers the support with 
probability at least 1 — e"'^^"* — C'e"'^^""'^) — e~'^™/^ — ^^jj— where C, C, c, c are constants. 

Proof. See Section HI □ 

Remark 3.3. This theorem implies that the miss detection probability is exactly w.h.p. but does 
not say anything about the number of false alarms. We leave the discussion on false alarms to the 
next subsection (see Theorem 13. Sp . 

Remark 3.4. Note that the sublinear sparsity is not covered by the part (A) of the theorem. The 
reason can be attributed to relative increase in noise level. Note that when we only assume an i^o 
bound on w and the sparsity of x is sublinear in n the ratio of signal power to noise power can be 
vanishingly small, i.e., as small as G (log n/n). This does not happen for the linear case and it is no 
smaller than 0(log n). For this reason we need to scale the noise power as well, which is the result of 
part (B). For the output noise model the scaling of noise power with the number of measurements 
occurs naturally as it will become clear in Section [5] and we do not need this constraint there. 



3.3 Eliminating Non-Support Elements 

Theorem 13.41 onlv ensures no miss-detection in the support. However, the number of false alarms 
can also be reduced to zero through a standard regression technique. 

To achieve zero false alarms we take thrice the number of measurements required for support 
detection. We partition the measurements into two parts. The first m measurements are used to 
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estimate the support elements using TBP. Since the basic feasible solutiorlfl of a linear program 
can only have at most m non-zero entries the support of x can be identified to within m elements. 
We next utilize 2m measurements in a regression problem to estimate the support of x using a 
standard least squares algorithm. Our modified algorithm is as follows. 



Step 1: We take 3m measurements where m is the number of measurements used in Theorem 13.41 
Next we partition the measurements into two parts, yi G M'" and y2 G M^'" and also partition 
the sensing matrix correspondingly 



G 



G2 



Step 2: We apply the TBP algorithm proposed with respect to the first m measurements yi. 
Denote / as the indices of nonzero components from this step. The number of non-zero 
components is at most m since the optimal solution to a linear program is a basic feasible 
solution. 

Step 3: Using the second set of measurements y2, we compute 

X = G\ jy2 = gI j {G2{x + W2)) 

where G2J is the submatrix of G2 that comprises the columns in index set I and G\ ^ = 
{G2 jG2j)~^G2 J represents the Moore-Penrose pseudo-inverse of 



Step 4: We next threshold the solution x if its magnitude is small, i.e. 

Q{xj) 



if \xj\ < i 
Xi otherwise. 



min . 

2 ' 



Remark 3.5. From Theorem 13.41 all the support components are included in / w.h.p. after Step 2. 
The Steps 3, 4 are intended to eliminate those potential false alarms from /. 

Remark 3.6. The simulation in Section [6] seems to suggest that this modified algorithm is unnec- 
essary and TBP by itself is sufficient for both detecting the support and eliminating false alarms. 
However, our analysis requires this post-processing. 

This modified algorithm (referred to as TBP+OLS) is guaranteed to exactly recover the sign pattern 
of signal x w.h.p. 

Theorem 3.5. (A) Consider the setup of Theorem |g.^| (^AJ. The TBP+OLS algorithm described 
above results in zero false positives and negatives with probability: 

Pr{Nm = 0, iV. = 0) > 1 - , - 2.24"("-™) - 26-"'"" - 2e s - e 

y/'Klogn 

(B) For the setup of Theorem \3.4Y ^ )■ The TBP+OLS algorithm described above results in zero 
false positives and negatives with probability: 



VT{Nm = 0, iV/ = 0) > 1 - e-'=i'" - C'e-^("-'") - e"^"^/^ 



Proof. See Appendix (Section (Tj). □ 



''Basic feasible solution in simplex method is a solution obtained by setting any n — m variables to zeroes in a 
system of m linear equations in n variables, and solving for the values of remaining m variables 
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4 Proof of Theorem 13. 4t Sign Pattern Recovery for Input Noise 



The proof can be broken down into three steps. 

(1) For a sufficiently small noise level, w we know that I supp C I supp- As the noise level is increased 
this situation may continue to hold even when Isupp changes. However, we show in Lemma 14.11 
that minimum noise level at which Isupp changes must result in Isupp ^ I supp y namely, we must first 
loose one or more of the support elements. 

(2) We are then reduced to determining the minimum noise level before one or more of the sup- 
port elements are lost. It turns out that this regime (noise levels before we loose support) is best 
characterized in an equivalent null space setting(see Section [4.ip . The null space setting reveals a 
structural property of the optimal solution (see Lemma l4.4p . namely, that the change in the esti- 
mated solution X as the noise level is increased satisfies certain conditional independence properties. 

(3) The conditional independence property directly leads to computable bounds on the maximum 
perturbation, \xj — Xj\, j S Isupp as a function of noise level(see Section [4.2p for the linear regime. 
Extensions to the general case k < an is then presented in the following section. 

We establish the first step by considering unit vectors along a specific (but arbitrary) direction. 
To this end, consider a unit vector w (in i^o sense) and a scaling parameter e. We have, 

y = G{x + ew), \\w\\oo = 'i- (15) 

Note that this is the same model as in Equation [2] except that we have extracted the noise level 
into a separate variable e. 

Let Gt be the optimal basis associated with the optimal LP (i.e., Step 1 of TBP in Section [2]) 
solution, where T is the column index set of size m. Without loss of generality assume that the true 
support Isupp = {1, 2, k}, i.e., the first k components. Lemma [3.31 savs that for sufficiently 
small e > support detection is guaranteed with high probability, namely, Isupp C T. Fix a value 
e and the vector w for which support detection is guaranteed. 

Note that for fixed w the basis, Gt G M"^^"^, continues to be optimal for smaller values of 
e. Furthermore, as e increases Gt remains optimal until a column in T violates the optimality 
condition. For convenience we denote by: 



L — Gjn G 



(16) 



where on the RHS we have partitioned L into two submatrices, Lq £ M'^^" and Li G 'j^im-k)xn ^ 
Note that the optimal solution to Basis Pursuit has m non-zero elements, which is xt = xt + eLw 
and xt^c = for sufficiently small values of e. The perturbation on the support elements is given 
by: 

% = + ^iLow)j, j G Isupp 

Denote by 70 the following: 

70 = min{e : xj + e{Low)j = 0, for some j G Isupp} (17) 

Lemma 4.1. Fix a vector w and a scalar e > such that support detection is guaranteed (which 
is guaranteed with high probability by Lemma \3.3\) . Denote the associated optimal basis by Gt- It 
follows that, Gt remains optimal for all e G [0, 70] . 

Proof. The proof follows from primal-dual characterization of optimality. Lemma 13.31 savs that for 
sufficiently small e, the above x {xt = xt + eLw and xt'^ = 0) is the optimal solution of Basis 
Pursuit (primal problem). We denote vr as the optimal solution of the following dual problem: 

maximize ir'^y subject to — 1 < tt^G < 1 
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Duality theory [20] says the optimal primal cost equals dual cost, 

Tr'^G{x + ew) = vr'^y = \\xt + e^if ||i 

We know from Lemma 13.31 that the reconstruction error eLw will not exceed ^a^min for sufficiently 
small e. This implies for i G {1, 2, • • • , A:}, the sign of Xi + e{Lw)i is determined by Xi. Therefore 
we have, 

k n 

TT^ G{x + ew) = \\xT + eLw\\\ = ''^^sgn{xi){x + eLw)i + \{Lw)i\. 

i=l i=k+l 

From complementary slackness, (vr'G)^ = sgn{xi) for any index i in the support(i.e., i = 1,2, ■ ■ ■ ,k). 
Then the above equation can be further simplified to 

k n 

■n'^Gew = sgn{xi){eLw)i + \{eL'w)i\ 

i=l i=k+l 

We now consider any positive 7 < 70. Multiplying by 7/e and adding Yli=i sgn{xi)xi to both 
sides, we will have 

k n 

Tr'^G{x + ^w) = '^sgn{xi){x + ^Lw)i+ ^ \{'yLw)i\ (18) 

t=l i=k+l 

By definition of 7 we know that, {x + 'jLw)i has the same sign as Xj for i E Isupp- Consequently, 
the RHS of Equation [18] is exactly + 7Lu)||i and the whole equation can be rewritten as, 

Tr'^y = \\xt + 7L^«||l 

which exactly implies the primal cost equals dual cost for the primal-dual pair {x + ^Lw,tt). 
Therefore, we do not switch the optimal basis when noise is scaled upto 7. Now since 7 < 70 can 
be arbitrary the result holds for the limiting value 70 as well. □ 

The task remaining reduces to determining the gain of the operator Lq. To this end we pass 
into a null space characterization. 

4.1 Null Space Characterization 

We first quote a classical result for Grassmanian manifolds (see Theorem 2.2 of [21] for more details). 

Lemma 4.2. There is a unique distribution on m- dimensional subspaces 0/ M" that is invariant 
under orthogonal transformations. A subspace from this distribution can be generated as: 

1. The range of a random orthonormal n x m matrix with the orthogonal invariant(OI) distri- 
bution; 

2. The orthogonal complement of the range of a random orthonormal n x [n — m) matrix with 
the 01 distribution; 

3. The range of a standard Gaussian random n x m matrix; or 

4. The null space of a standard Gaussian random (n — m) x n matrix. 
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Lemma 14.21 provides the tool for converting the original problem formulation into a null space 
characterization. 

Lemma 4.3. Suppose y = G{x+ew). There is a one-to-one correspondence between the constrained 
optimization problem 

mm\\(3\\i s.t. y = G/3 (19) 

and the unconstrained optimization problem 

min ||x + e^« + 111 (20) 

V 

such that the optimal solution x of Equation [TP] and the optimal solution v of Equation satisfy 
X = x-\-ew-\-Av. Moreover, the entries of A €z ig'^x("-'") can be regarded as i.i.d. Gaussian samples 
from A/'(0, ^^^ ) if the entries of G is i.i.d. Gaussian samples from Af{0, 

Proof. Choose A as the n — m dimensional null space of G. Note that it follows from applying 
parts (3) and (4) of Lemma 14.31 that we can realize A as an IID Gaussian matrix with the specified 
properties (also see [21] )• Then any /3 satisfying y = Gf3 can be written as /3 = x + ew + Av, where 
V is an (n — m) dimensional vector. This implies the original LP algorithm 

min||/3||i s.t.y = GI3 

can be converted into the following equivalent unconstrained optimization problem: 

min ||x + + Avili (21) 

V 

More importantly, Lemma 14.21 implies that the entries of A can be characterized as i.i.d Gaussian 
random variables. Finally, we note that the global normalization factor ^ or — on the Gaussian 
distribution will not influence the result of Lemma 14. 2i □ 

For convenience we denote v as the optimal solution of the null space problem: 

-u = argmin^||x + + ^i)||i (22) 

Based on the above characterization the value 70 of Equation [T7] can be equivalently cast in the 
null space. First we note that 

n—m 
k=l 

because both of them represent the reconstruction error in the support. Based on the equality we 
have 

n—m 

\\eLow\\oc < e + max | Afc%| (23) 

This implies that we are left to understand how | ^11^=1 ■^ikVk\ scales with increasing e. Our 
main result of this section characterizes a structural property of the optimal solution. It establishes 
weak dependence between optimal solution and the elements of the support set. 
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Lemma 4.4. Assume v is the optimal solution to Equation \22\ when e < 70, where 70 is defined 
in Equation \17\ Denote F = Y2^_^sgn{xi)Ai where Ai represents the ith row of A. Then if the 
RIP condition is satisfied (i.e., k < an and 82an — \); optimal solution v is only determined 
by Ak+i,Ak+2, ■■■ ,An and F. 

Proof. We know from Lemma 14.11 and Lemma 13.31 that if RIP condition is satisfied the optimal 
V recovers the sign pattern of x when e < 70 in Equation [22j This implies the sign of xi + 
ewi + Aiv{i = I,-- - ,n) is sgn{xi). Now consider a small neighborhood N{v) of v such that 
the sign of Xi + eWi + Aiv{i = 1, • • • ,n,v € N[v)) does not change in this neighborhood. Then 
min^g^(£j) ||x + + Av\\i is equivalent to 

k n 

min 'S^ sgn{xi){xi + ewi + Aiv) + \^ \{ew + Av)i\ (24) 

^ ' 1=1 i=k+l 

For linear optimization, local optimum is also the global optimum. Therefore, by neglecting the 
constant term X]i=i sgn{xi){xi + ewj), v is also the optimal solution to 

k n ( ^ ^ 

min'y^^ sgn{xi){Aiv) + \{ew + Av)i\ = min < Fv + |(et(; + > (25) 

1=1 i=k+l I i=k+l ) 

where F = Yli=i sgn{xi)Ai is defined in the assumption of the lemma. This implies that the 
optimal solution v depends on ^i, ^42, • • • ,Ak only through their sum F. In other words, v is only 
a function of Af^^i, j4^_|_2, ■ ■ ■ , A^ and F as long as the RIP condition 62an 

< y is satisfied. □ 

Remark 4.1. The above result implies that there is only a weak dependence between the optimal 
solution, V and the rows of A corresponding to the support elements. 

4.2 Linear Sparsity Case: Proof of Theorem 13.41 (A) 

In this Section we only deal with the case ^an < k < an based on the result of Lemma 14.91 Our 
task is to determine the maximum tolerable e or alternatively, compute 70 (see Eq. [T7|) . Our main 
result in this section is as follows: 

Theorem 4.5. Consider the linear sparsity case described above. Then 70 = eo where cq is specified 
in Theorem \3.4][ A.) which is on the order o/0(-^/=^). 

We establish the result through a sequence of steps. First, we need the following standard result 
on singular values for Gaussian matrices. 

Lemma 4.6. Suppose A G ^nx{n-m) ^ Gaussian matrix with i.i.d. entries drawn from J\f{0, n~m. )- 
Suppose n = Cm where constant C satisfies 1 < C < 00, then we have 

- 1 1 ||z;||2, for allv€ M'"""' 

{ ^/n— \/n — m)^ 

with probability > 1 — e s . 
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Proof. This lemma is a direct Corollary of a result in [22]. In [22], it is proved that cJmin has the 
concentration property: 



CTmin > \l ^ - 1 - t/^/n-rn\ > 1 



e 



We set t = ^ (v^ ~ \/n — m) and the lemma follows. □ 

Definition 4.1. Assume t) is the optimal solution of Equation [22j We denote by the intersection 
of two sets: 

Ao = i A : \\Av\\2 < C, ■ """""^ 



n 1^ : \\Av\\2 > \ (v/C/(C- 1) - l) ||t;||2,V v € M"-"^| 

Intuitively speaking, contains all those well-behaved matrices A such that Basis pursuit 
results in good solution in £2 sense and the smallest singular value is lower bounded. We have the 
following property for ^q- 

Lemma 4.7. Assume e < in Equation\22l Then we have 

1. If A £ Aq, then v is only determined by {F, A^+i, • • • , An} where 

k 

F = J2 s9n{xi)Ai (26) 
1=1 

In other words v is conditionally independent of Ai for 1 < I < k when conditioned on F. 

( y/n — y/n — rri^ ^ 

2. The measure of Aq satisfies Pr(^ € ^0) ^ 1 — e"'^^™ — e s . 

Proof. The first part of the lemma follows directly from Lemma 14.41 For the second part, we 
note that if G satisfies RIP condition and we convert the problem to null space characterization 
miut, llx + + the solution v will satisfy ||^v||2 < Cs ■ ^^^^ via Theorem 13.11 This implies 

that Pr{^ : \\Av\\2 < Cs ■ ^} > 1 - e"^!" 

Finally, we know from the concentration inequality in Lemma 14.61 that 

1 / / \ ( x/n — v/n — m)^ 



Pr{A : \\Av\\2 > - [VC/{C - 1) - ij |b||2,V v G M""™} > 1 - e 

and hence the second part of the lemma follows. □ 

We also need the tail probability of ||-F||2- 
Lemma 4.8. Suppose F = Yli=i sgn{xi)Ai, then we have, 

- {n—m) 



Pr(||F||2 < V2k) > 1 - 2.24; 
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Proof. We note that = (^^i=i ^9'''^{^i)^ij^ ■ Then we can rewrite ^^-^H-F'lli as 

n—m / k I \ ^ 

j=l \i=l / 



Since each Aij is i.i.d Gaussian AA(0, ^j*;^ ) from Lemma [4.2| (^X]i=i y Tp^ij j is i.i.d 

standard Gaussian random variable. Therefore ^^^^^^H-FUl is a distribution with (n — m) degree 
of freedom. From the tail probability of distribution, we have 

Pr (^1^\\F\\l < 2{n - m)^ > 1 - 2.24-("^'"). 

□ 

Lemma 4.9. Suppose, A G CLnd we are in the linear sparsity regime, namely, ^an < k < an 
where a is an absolute constant such that 62an < j ■ Then, 

max lAivl > (di + (i2\/21ogn)e ) < , + 2.24-(''-'") + 2e-^i"' + 2e s" " 

i£{i,-,k} ^ 'J log n 

where e is the ioo bound on input noise, w; Ai is the Ith row of A; and {di,d2) are absolute constants 
which only depend on a and C: 



-1 ^ / ; \ -1 



Proof It is not easy to directly bound Pr (max^gji ... ^j. \ Aiv\ > {di + d2^/2logn)e) because Ai and 
V are (weakly) correlated in general. Therefore we introduce an auxiliary variable v* as 



i)* = arg minlF?; + \{w + Av)i\} (27) 

i=k+l 

Now V* and Ai{l £ {1, , ■ ■ ■ ,k}) are independent given F = f. From Lemma 14.41 and Definition 
I4.H V = V* if A € Aq. Moreover, if A € Aq, the £2 norm of v* can be bounded via applying Lemma 
iMland Theorem [O 

\\v*h<2(^f^^-l] \\Av*\\2<2(.f^^-l] Cs-^ (28) 



c-1 / " - I V c - 1 / h 



For simplicity of notation, we denote the RHS as C ^fne by introducing C' := 2\/2 yyj — 1 j Cgj \fa.- 
Then, we can relate v and v* from the law of total probability, 

Pr I max \Aiv\ > (di + (i2\/21og n)e ) 

= Pr ( max l^^i)! > {di + (i2\/21ogn)e, A G ^0 ) + Pr ( max \Aiv\ > {di + d2y/2logn)e, A ^ Aq 
\ie{i,--- ,k} ) \«e{i,--- ,fc} 

< Pr ( max \Aiv*\ > {di + ^2^/2 log n)e ) + Pr (^ A) 
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The second term is bounded by e"'^^'" + e § and the remaining task is to bound the first 

term. To this end we fix an index I G {1, 2, . . . , A;} and define, 

Vi = {{Auv*) : \Aiv*\ > {dl + d2^/2h^)€}, V2 = {{v*,F) : ||r|| < C'V^e, \\F\\ < V2k} 

Let I^.,i=l,2 denote the corresponding indicator functions on these sets. Then, we can write 

Pr (\Aiv*\ > (di +(i2\/21ogn)e) =E(Ivi) = ¥.f,v*{E{Iv, \ F,v*)) 

= E^,,.(Iv2E(Ivi I +E^,,*(IvcE(]Ivi | F,v*)) 

< Ep^i,* (Iv2E(Ivi \F)) + 2.24-("-"^) + Pr Aq) (29) 

where the first term in the last inequahty follows from the Conditional Independence Lemma 14.71 
and the second term in the last inequality follows from Lemma [4. 81 and Equation [28j The first term 
in the above expression can be further simplified by noting that, Ai \f is a Gaussian random vector 
and Aiv* |i? is a Gaussian random variable for any fixed v* . So, we compute the conditional means 
and variances: 

E{Ai\F = f) = sgn{xi)^, A^,,; = (l - (30) 

Now for {F,v*) G V2 we have II/II2 < \/2A; and ||f||2 < C'^/ne. This leads to a bound on the 
conditional mean and variance of Gaussian variable Aiv*\p=f. By applying the result in Equation 
[30] the absolute value of its mean is bounded by 

mAi'iF = /)| = i^/n < i||/||.||r||. < lyy (7^- 1) c.^f . 

where the first inequality follows from Cauchy- Schwartz inequality. Recall that C is ratio n/m; Cg 
is described in Equation [TH e is the i^o bound on the input noise, w. 

Now by using the assumption ^an < k < an and ||/||2 < V^k, \E{Aiv*\F = f)\ can be bounded 

by, 

And the variance is bounded through 

Combining the above two bounds on mean and variance we can bound the first term in the final 
expression in Equation 1291 

E{Iv, I iF,v*)) = Pr{\Aiv*\ > die + d2€t\F = f,v* = v) <2Q{t) < ^=6-*'/^ 

when II/II2 < V^k and ||'y||2 < C \pne. Substitute t with \/ 2 log n in the above equation and we 
have 

Vy(\Aiv*\ > (di+d2V21ogn)e|F = /,r =v\ < , when ||/||2 < \/2A?, ||7;||2 < C'^^e. 

V / KVvrlogn 
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Finally by applying a union bound when ||/||2 < \/2fc, ||?;||2 < C ^fne we get: 
Pr ( max \Aiv*\ > {di + 1^2 \/2 log n)e\F=f,v*=v ] < kFv (\Aiv*\ > (di + (i2\/21og n)e\F=f,v*=v 

And this proves the lemma. □ 

Remark 4.2. Note that core of the proof is based on the upper bound for \\v* II2 in Equationl281 which 
is established by an upper bound for HtfUi. Therefore, if we relax the assumption ||u)||oo < e to the 
ii assumption H'U'lli < ne, this lemma still holds true. However, as suggested by Equation 1231 we 
still need a constant £00 bound on w (say ||if ||oo < |) to ensure correct recovery after thresholding. 
This observation will be used in Section [5j 

Part A of Theorem 13.41 now follows by combining Equation [23] and Lemma 14.91 and picking 
e = ^ (1 + di + (i2\/logn) . 

4.3 General Sparsity Case: Proof of Theorems 13.4( B) 

The bounding techniques in the last section has to be modified for the general sparsity case k < an. 
In this section, we describe the extension to the general case. Note that the bounds developed 
here for the general sparsity case also provide bounds for the linear regime. However, the SNR 
requirements here are slightly more conservative. Indeed as our statement of Theorem l3.4r B) 
suggests we need additional constraints on the noise w to ensure perfect support detection with 
high probability. 

The main reason why linear sparsity result of Lemma 14.91 does not generalize to the case when 
k is small can be attributed to two reasons: 

• Significantly higher effective noise power: Note that when we only assume that ||w||oo ^ 1 and 
the sparsity of x satisfies k <^ n the ratio of signal power to noise power can be vanishingly 
small, i.e., as small as 0{k/n). For the linear regime this ratio scales as Vt{l). 

• Near singularity of the null space matrix A: Note that when k is sublinear with respect to n, 
the matrix A is nearly square and the result of Lemma 14.61 no longer applies. 

Recall the setup of Part (B): 

ke 

y = G{x + w), Halloo < e, ||w||i < ^=^= (31) 

Vlogn 

We point out that in this representation we have absorbed the noise level into the noise w, which 
is different from the situation considered in Eq. [15] for the linear regime. 

The main focus in this section is to establish that for e < l/y/log{n) support recovery is ensured. 
Our steps mirror those required for the linear regime. Specifically, we can define 70 equivalently as 
in Equation 1171 adapted to the setting of Eq. [31] above. 

Again we appeal to the ^2 bound of Theorem 13. 11 This result holds for all sparsity levels. Given 
the above setup, the solution to the problem of Equation 1221 satisfies: 

\\Av\\2 < Cs^^ < CsVke/^/hi^ 

with high probability. 

Before proceeding to find bounds for e we need to deal with the issue of non-singularity of A. 
We quote a recent result of Rudelson [23] : 
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Theorem 4.10 (Rudelson). Let X be an N x n matrix whose entries are i.i.d. standard Gaussian 
7V(0, 1). Denote 6 = {N - n)/n and 

o"min the Smallest singular value of X. Then for any t such 

that Cn~^l'^ <t<c6, 

Pr ((Tinin <tS■^/n) < C'exp(-cn) + (t/c(5)'^" 

where C, C, c, c are constants. 

Adapting Theorem 14.101 to our context, we have 

Lemma 4.11. Let A be annx (n — m) matrix whose entries are i.i.d. Gaussian Af{0, ^^^ ). Then 
for any t such that C{n — < t < cm/{n — m), 

Pr (a^in{A) < (^^) ■ 7^ ) < Cexp(-c(n - m)) + exp(-cm/e). (32) 



Proof. Denote X = \Jn — mA and we apply Theorem 14.101 with 5 = m/{n — m) and t 



m c 



n—m e ' 



Pr (Jmin(^) < ( ) J^^— <C'exp(-c(n-m)) + exp(-cm/e). (33) 

\ \n — m J \n — m 2e I 

Note that ^ > 1 and the above equation can be simphfied to 

Pr (a,nin{A) < (^^\ . ^ ) < (7exp(-c(n - m)) + exp(-cm/e). (34) 
\ \n — m J 2e I 

□ 

Parallel to the Definition 14 . 1 1 and Equation 1271 we define the typical set for the general case. 

Definition 4.2. Assume v is the optimal solution of the optimization problem min^, ||x + t(; + j4t;||i. 
We denote Ai as the intersection of two sets: 

^1 = 1^: \\Avh < C, • ^1 n 1^ : araUA) > (;^)' ^| (35) 

and the auxiliary variable v'^ is defined as 

= argmin„^|j^|j2<^„2 <Fv+ ^ \{w + Av)i\ \ (36) 

I i=k+l ) 

where F = Ya^i sgn{xi)Ai. 

Remark 4.3. The superscript c in v'^ stands for "constrained". 
Now we have the following property for and v'^. 

Lemma 4.12. Assume e < 70 in Equation\31\ Then we have 

1. If A £ Ai, then v = v""; 

2. Pr{A G ^1) > 1 - e-^i™ - C'e-^^""'") - 6-^""/%- 
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3. Suppose m = Cklog{n/k) and minjciC, cC/e} > 6, and Ai is the ith row of A, then, 

niAv'f] < + 0(1)), = 1, • • • , fc. (37) 

log?! 

Proof. We know from Lemma 14.41 that if A £ Ai, v is the solution of 

min<Fv+ |(i(; + )i| > . 

^ [ i=k+i J 

Furthermore, if A G Ai it also has lower-bounded smallest singular value. This implies 



V 2 < a;^l{A) Av 2 < eCsJ- < en' 38 

y log n \ m J c 

where the last inequality holds true when m = Q{k log{n/k)). Equation 1381 implies that v is actually 
located in the feasible region of minimization problem 1361 Therefore part (1) of the lemma follows. 

We know from Lemma [iTTl that Pr{^ ■ \\Av\\2 < Cs ■ ^^^} > 1 - e~''i"\ Combining this result 
with Equation [32l part (2) of the lemma follows. 

Now we want to prove part (3). First, we have the following bound due to Theorem l3.lt 

Define the partition {Bi]i oi Af. Bi=A'ir\{A: \\A\\], < 2n} and Bi = Aln{A:i-n< \\A\\l, < 

[i + l)n} when i >2. Here \\A\\f = I j A- j denotes the Frobenius norm of matrix A. 
When A £ Bi, we have a loose bound 

k 
i=l 

where the second last inequality follows from the definition of Bi and lies in the feasible set 
{v : ||t;||2 < en^}. 

When A G Bi{i = 2, 3, • • • ), we have 

k 

i^(A,^'=)2 < IwAv'Wl < IuWfWvH < l{i + l)n{e7i'f < e'n'i 

i=l 

On the other hand, the probability measures of Bi satisfies, 

Pr(i3i) < Pr(^^) < e-"'"" + Ce-^^""'") + 6"^'"/'= 

/g(i-i)/2\ 



Pr(i3i) < Pr{^ ■.i-n< \\A\\f} < 
where the last inequality follows from the tail probability of distribution (note that ||(n — m)A|||, 



is X distributed with degree n{n — m)). It is easy to check that ^-^/j — — V 3 

'2 \ ""^/^ 
Pr(^i) < ( -i j , Vi > 2 
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Then we can bound E[(Ai{;'^) ] as follows. 

k 



E 



1=1 

k 



i=l 



Pr(^i) +E 



k 

}-Y^{Aiv'^f\A^Bi 



4 = 1 



Pr(^i) 



i=2 



1 ^ 



i=l 



Pr(^, 



i=2 



-11^ /A 



logn 

When minjciC, cC/e} > 6, it is easy to check that 

lim ( e^"!"' + Ce~^^"^ + e""^™/") logn < lim ( e"6iogn ^ g-6iogn\ ^^g^ ^ y^^^ 2}ogn ^ 



We can also show X]i^2 (|^) logn )• 0: 

/r)\ -71^/4 OO 

1=2 



n I \ —I 
3 



logn = nM^ 5]z--^/^+Mog 



n 



fi=2 



n 



< n^ (2/3)-" logn / x-^'/^+^dx 

74 = 2 

= n^ (2/3)-"'/' 2-"'/4+2(^2/4 _ 2)-i 

< 16?i=^(4/3)-"'/^logn 

□ 

Remark 4.4. It is easy to check numerically that when n is reasonably large (e.g., n > 20), the 
o(l) term in equation [37] is actually smaller than one. Therefore, in the later discussion we assume 
n > 20 and 



2e2c2 



E[{Aivy] < Vi = l,-- - ,A:. 

log n 



(39) 



Lemma 4.13. Assume e < 70 in EquaUon\31\ so that support detection is guaranteed. We have 



rr max \AiV\ > — - 

lie{i,-,fc} (logn)V4 



+ 2eC,(logn)i/4 < e-^^" + Ce-^^"""^) + e-""™/*^ + 



Vlog 



n 



Proo/. Denote = EKAiV^flF]. Note that by the symmetry of Ai, all E[(Ai?;^)2|F]'s should take 
the same value and therefore Sp does not depend on i. 

From equation [391 we have Ej7'[5i;'] < t^^- Define the set = {F : Sp > -^^=^}. Then we 
know that J^q has negligible probability measure: 

Pr(Jo) < {Pt{To)E[Sf\F G Jq] + Pr(J5)E[5^|F e T^]) G Tq])"' 



Ef[Sf]{E[Sf\F e J^o]r' < 



Vlog 



n 
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On the other hand, conditioned on F, Ai{i = 1, • • • ,k) and v'^ are independent (c.f. equation 
[36]) . Therefore, we can regard Aiv'^\p as a Gaussian random variable and we use fip and ap to 
denote its mean and variance. Now we consider the case F ^ Tq. 

From the above discussion, we know that in this case + Op < --j^=. This imphes that 

— (log n) V4 — y^logn ' ^^^^ ^^^^ probability of Gaussian distribution, we have 

Pr{\AiV^\ > tiF + (TF- t\F Jo) < 2Q{t) < ^=6''^'/^ 



Substitute t with \/2 log n and substitute [Xp and op with the above bound, we have 
Pr(|^,r| > + 2eC,(log?i)i/^|F J-q) < ^ 



(logn)^/"^ n-v/vFTogn 
Finally we apply the union bound and have 

Pr( max \AiV^\ > + 2eC.(log n)^/"|F ^ J-p) < /, . (40) 

Furthermore, we know from Lemma 14.121 that if A E Ai, v = v^. For simplicity of notation we 
denote our objective as 

£ :=[ max > ^^^^ + 2eCs{logn)'A 

\te{i,-,k} (logn)V4 



and correspondingly 



£' :={ max \A,v'\ > + 2eC,(log n)^^ 

\ie{i,-,k} (logre)^/* 



Under this notation. Equation [40] can simplified to 

Pr(£:|F G T^) < ^ 



n^/^^^ogn 

Hence we can bound the probability of the unconditioned event £. 

Fr{£) = Pr(£:|FG J-q" and AG^i)Pr(Jo'n^i)+Pr(£:|FG J-Q or ^e^5)Pr(J-oU^5) 
= Pr(£:'=|F G Jo' and A G ^i) Pr(J"^ n ^i) + Pr(^:|F G Jo or ^ G AI) Pr(Jo U Al) 
< Pr(£:"|F G Jo and ^ G ^i) + Pr(Jo U AD 

The second term is already bounded by 1 /\/log n + e"*^!'" + (7e~'^("~'") + e~'^"^/^. And the first term 
can also be bounded, 

Pr(f^|F G Jo and A e Ai) < (Pr(r|F G Jq and A G ^i) Pv{Ai) 

+ Pr(£:^|F G Jo" and A G ^5) Pr(^^))(Pr(^i))-i 
= Pr(f"|FG Jo")(Pr(A))-i 

< — . ^ (l- e-"i™ - C'e-"("-'") - e-^"'/^l < ^ 



n^/^^Togn V / \/log n 

□ 

We are now ready to establish the proof of Theorem 13.4( B) . From Lemma [4.131 it follows that 
for e < log(n)~^/^/5Cs the worst-case perturbation is smaller than 1/2 with high probability. 
Decomposing as in Equation [23] and applying Lemma 14.11 it follows that the support detection is 
guaranteed. This is the statement of Theorem 13. 4l fB). 
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5 Output Noise Model 

In this section, we present the results for the output noise model (Equation [1]) by converting it to 
an equivalent input noise model (Equation [2D . 

The goal of this section is to prove a parallel result to Theorem 13.41 and 13.51 

Theorem 5.1. Consider the setup of Eq. [IJ We fix the number of measurements to 3m, where, 
m > C2 log (^) 2k (see Eq. \13[ which arises from the RIP property). We now consider two separate 
cases: 

(A) Linear sparsity, namely, an/2 < k < an for some a > such that the RIP constant 52an < 
1/7 is satisfied. For this case we fix the SNR to satisfy SNR > ta log n for some constant ta; 
Then the TBP+OLS algorithm achieves zero false positives and negatives with high probability, 
namely, 

Pr(iV„ = 0, iV/ = 0) > 1 - - 2.24"'" - 2.24-^"-'") - 2e-'=i"' - Ae s -. 



Vtt log 



n 



(G) General sparsity (k < an with a as in (A) ). For this case we fix the SNR to satisfy SNR > 
Ts log^ n for some constant tb- Then the TBP+OLS algorithm achieves zero false positives 
and negatives with high probability, namely, 



Pr(iV„ = 0, iVf = 0) > 1 - e^'i™ - Ce-'^""'") - e"""'"/'^ - 2.24""^ - -r== - 2e 

^/\og n 



8 



2 



Remark 5.1. The proof of this theorem is based on the link between the input and the output noise 
model. We first present the "essential" equivalence between the two models and then point out the 
modifications needed in adapting the proof of Theorem 13.41 to this proof. 

Remark 5.2. Compared to Theorem 13.41 and 13. 5| an extra logn factor is required for the SNR level 
in part (B) of the theorem. This log n factor arises from the looseness of the general case bounds 
for input noise. 

To prove the theorem, we consider the following equation for w: 

Gw = e. 

This is an over-determined equation with infinitely many possible solutions for w. Our approach 
is to choose the minimum norm solution [23] for w, namely, w = G'^ {GG'^)~^e. 

Next we establish that this solution results in a satisfactory choice. Suppose the singular value 
decomposition (SVD) of G'^ = UEV'^ where U e M"^™ and S, V G M"^^™, then we have 

w = U^-^V^e. (41) 

To see this note that (GG'^)"^ = FS^^j/T rj.^^^ follows that G'^(GG^)-^ = UT,-^^ . To 
express the relation between w and e quantitatively via Equation Wl[ we need the following two 
lemmas. 

Lemma 5.2. Suppose e is independent Gaussian noise with distribution J\f{0, ef) and U G 
is an orthonormal matrix. Denote e = maxj . Then 



|f^e||oo < e\/21ogn with probability > 1 . 

Vvr logn 
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Proof. Let Ui be the ith row of U. Then we know that UiC is still a Gaussian variable with zero mean 
and variance < e^. Hence, from the union bound and the tail probability of Gaussian distribution, 

P(||C/e||oo > te) < j;P(|C/,e| > te) < ■ '—^ (42) 
Taking t = yJ2 log n in the above inequality, we have, 



|f^e||oo < eY^2 log n with probability > 1 



\/vrlogn 

□ 



The next lemma is a classical result on the concentration property of the smallest and largest 
singular values of Gaussian matrix G (see |22j for example). 

Lemma 5.3 ([22]). Suppose G € ]K'"X" ^5 a random matrix such that each entry Gij ~ Af{0, ^). 
We also assume {n,m) satisfies n = Cm where C > 1. Then the smallest singular value fTmm o-iT-d 
the largest singular value Cmax of G satisfies the following inequality: 

F{VC - 1 - t/^/^ < CTmin < CJmax < VC + I + t/v^) > 1 " 26-*'/^ 

Combining the above two lemmas, we have the following bound for ||w||oo and HtDHi. 

Lemma 5.4. Suppose Gw = e where Gij is i.i.d Gaussian A/'(0, — ) and n = Cm and C > 1. If 
e ~ A/'(0, e^I™^™), then the minimum norm solution w = C'^ {GG'^)~^e satisfies 

u u 2e r-. 

Halloo < ^= V^logn 



Vc -I 



■\2 



with probability > 1 , , — e s and 

ll'ii^lli < 2\f7C{^fC -Xy^me 
with probability > 1 — 2.24""* — e s . 

Proof. See Appendix. □ 

For Theorem 15.1( A) in the linear regime we note that e = l/\/r^logn. It follows that, we 
have an equivalent input noise model from Lemma 15.41 with probability > 1 — ^ — 2.24~"^ — 

_U/n-^/mf_ 

2e 8 such that 

2\/2 „ „ _ /r^^. n 



y 



G{x + w), \\w\\oo < r- ' ll^lli ^ 2V2C7-i(\/C- 1)"^ • 



(\/C-l)v^' V^^logn' 



According to the explanation in Remark 14.21 these two assumptions on w can replace the original 
conditions in Theorem 13.4( A) and still ensure correct support detection if ta is sufficiently large. 

For Theorem l5.1l fB) the variance of the ith component of noise Cj is = ^ ^^^3 ^ for some 
sufficiently large constant tb ■ It follows from Lemma 15.41 that the output noise model y = Gx + e 
is essentially equivalent to following input noise model, 

y = G{x + w), \\w\\oo < - : Iklli < 2V2C{VC- l)-^ ■ 



{VC - l)^/TB\ogn y/wlogn' 
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with probability > 1 , , — 2.24 ™ — 2e s . They match the conditions in Theorem 

— Vtt log n •' 

13.4( B) if TB is sufficiently large. 

Hence we can regard the output model as the related input noise model and mimic the steps 
of the proof in the last section. The main steps remain unchanged except for the perturbation 
computation in Lemma 14.91 The difficulty is that when we solve Gw = e from the minimum norm 
criterion and get w = G"^ {GG'^)~^e, w is weakly correlated with G and hence bounding techniques 
developed above might fail to work. This problem can be handled in the following way. 

The fundamental step of the proof to Lemma [4.9l is to bound the inner product of Ai and v. To 
accomplish this, we used the fact that v depends on Ai only through F. This type of reasoning can 
also be extended to the model y = Gx + e. First, we choose the minimum norm w = G^ {GG''")~^e 
and we have y = G{x + w). Now, v depends on Ai not only through F but also through w because 
w = G'^{GG'^)~^e might potentially depend on Ai. Therefore in the next step of bounding Aiv, we 
need to condition on both w and F to ensure that Ai and v are conditionally independent of each 
other. The minimum norm w has the property that w is in the range space of G'^, which implies 
A'^w = 0. Alternatively from a QR decomposition of = QR we see that w can be represented 
as w = QR~^e. It is well known that Q and R are independent if G is originally a Gaussian matrix 
(c.f. [25] )• Therefore if we suppose R~^e to be fixed (but unknown) then no information about Q 
can be deduced from w besides A^w = (i.e., w € span{Q)). Furthermore w can be assumed to 
be uniformly distributed on a sphere for the purpose of analysis. Particularly, this implies that the 
conditional distribution p{Ai\w) = plAilA^w = 0). Next, the conditional distribution j4;|^t^=o is 
still Gaussian and the knowledge of w only reveals average value of the rows of A, which is similar 
to dependency of Ai through F we had in Lemma [4.9[ Consequently, identical steps can be followed 
to establish the main result as well. Finally similar arguments as those used in Section 17.21 show 
that an OLS step will remove all the false alarms. 



6 Numerical Examples 

Our first example illustrates the performance difference between LASSO and basis pursuit (i.e., 
only Step 1 of TBP). In this example, we choose the signal dimension x to be 200 and set 10% of 
components to be nonzero. The sensing matrix G we use here is a 100 x 200 matrix, each element 
of which is i.i.d. Gaussian. Without loss of any generality we let the nonzero components to be 
the first k components of the signal. The effective SNR of the system is 61ogn. The reconstruction 
result is shown in Figure [2j From this example, we can see that while LASSO does as well as our 
algorithm in recovering the support, the amplitude values appear to be biased. 

We also recall Figure [T] of Section 11.21 for a more systematic comparison between these two 
approaches. One main difficulty we found in implementing LASSO was to determine the optimal 
tuning parameter, A. The analysis of [iTj suggests that A = 2a\/1 log n, where a is the variance of 
i.i.d additive Gaussian noise, would be good choice. On the other hand [26] recommends A = 2ay/n. 
Note that in both these instances we need to know the noise level. In our experimentation we also 
found that support recovery could be improved when A is allowed to depend on the number of 
measurements m and the sparsity level k as well. However, this is in general very difficult. 

For Fig. d] (left figure) we varied the sparsity level, k, while keeping the n = 200, m = 100 
and SNR = 61og(n) fixed. We only implemented the first two steps of TBP, i.e., we ignored the 
OLS step. To implement LASSO we experimented with different values of A and plotted the best 
parameter we could find. Specifically, for the left figure we optimized the error probability over A 
for specific sparsity levels via exhaustive search and we observed that A = 0.2 worked best. We 
fixed this value of A for all sparsity levels. We see that TBP significantly outperforms LASSO. The 
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Figure 2: LASSO vs. £i minimization (basis pursuit). The signal length is m = 200, 5% of 
components are I's, 5% of components are — I's and the rest 90% of components are O's. G is 
a random 100 x 200 matrix and SNR = 61ogn. LASSO is a biased estimator and gives poor 
reconstruction at nonzero components. 

success rate of LASSO begins to drop around k = 10 whereas the success rate of TBP begins to 
drop around k = 30. The phase transition of TBP happens much later compared to LASSO. 

For the second experiment (right figure in Fig. [1]) we fixed n = 200, k = 10 and varied the 
number of measurements m. Since m > k the plot starts at m = 20. Each point on the plot 
corresponds to an average over 80 Monte-Carlo trials. To get a good value for the tuning parameter, 
A, we again looked at specific measurement levels and optimized for the success probability. The 
optimal A turned out to be around 0.3. This was then fixed varying values of m. Again we see that 
TBP performs much better compared to LASSO. 

There are possibly two reasons for the poor performance of LASSO. First, we believe thresh- 
olding is really necessary even for shrinkage operator such as LASSO and second, we might not 
always be able to choose the optimal A. We are unaware of any results regarding how A adapts to 
different k and m. As pointed out in the previous discussion, this might be a serious problem in 
practice. 

In the last experiment (Figure [3|), we show how SNR level influences the probability of success 
in TBP. Here we implement only the first two steps of TBP (Basis Pursuit and thresholding) and 
do not use the extra regression step. Next we fix the signal dimension n, sparsity k and number of 
measurements m and simulate the results for different levels of SNR. Speciflcally, we fix (n, m, k) = 
(200, 100, 20) and Xmin = 1- We let a"^ = (2^12 log n + 2) • 6', and vary 6 from 10"^ to 10^ which 
varies the SNR. From our theory, we expect to see the phase transition around 0(logn), which is 
what is observed here. Each point on the curve(i.e., each SNR level) is an average of 200 Monte 
Carlo trials. 

We can see from Figure [3l^a,b) that the success probability curve jumps from zero to one around 
6 = 10^ for both the Gaussian and Bernoulli ensembles. Note that while our theory is based on 
the Gaussian ensemble, it appears that the results are not particularly sensitive to non-Gaussian 
ensembles. The simulation results also suggests that the OLS step in Section 13.31 (step 3 and 4) 
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Gaussian ensembles 



Rademacher ensembles 
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10 10 10 

Control parameter 6 in SNR 

(a) Gaussian ensembles 




10 10 10 

Control parameter 6 in SNR 

(b) Bernoulli ±1 ensembles 



Figure 3: Success probability of support recovery for TBP as a function of SNR for the output noise model. 
Here we fix {n,m,k) ~ (200,100,20) and Xmin = 1- We then let SNR = 4j- vary. Specifically, we let 
a^^ ~ (2^/12 logn + 2) • 9, and vary 9 from 10^^ to 10^. Each point on the curve is an average of 200 Monte 
Carlo trials. The phase transition happens around 9 = 10*^. (a) Gaussian ensembles: each component Gij of 
the sensing matrix is i.i.d. from J\f{0, (b) Bernoulli ±1 ensembles: each component Gij is independently 
chosen to be cither or — with equal probability. 

is not really necessary. Also in experimentation we did not find any qualitative or quantitative 
difference between sublinear and linear scenarios. Therefore it is an open problem whether the 
SNR gap between these two regimes in Theorem 15. If B) can be improved. 



7 Appendix 

7.1 £2 Approximation: Proof of Theorem 13.11 

In this section we prove Theorem 13.11 with respect to the sensing model: 

y = G{x + w) 

Denote z := x + w. We let Tq be the indices of the largest \Tq\ = k components of z = x + w. 
We further define the rest indices as Ti, • • • , Tj of equal size \Tj\ = M, j > 1 (where M is an design 
parameter and will be specified later), by decreasing order of magnitude. We also use Tqi to denote 
ToUTi. 

Denote the reconstruction error h := x — z and we have, 

IkTolli - \\hTo\\i - \\zT§\\i + WhT^Wi < II^To + /iTol|i + \\zt§ + hrgWi 

= \\x\\i < \\z\\i 



which can be simplified to 



WhTgWl < \\hTo\\l + 2||zToHli- 



(43) 



Next, we relate the £2 norm of /lyc to the li norm of hx^- It is obvious that the kth largest 
components of hx" satisfies 

\hTs{k)\<\\hTs\\l/k 
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Squaring both sides and then summing up from A: = M + 1 up to /c = n, we have, 



1 l\ ^ WhT^M 



f.2 - ll'-olli (k-l)k 
k=M+l k=M+l ^ ' 



hT4i y —^-^ <^:^. (44) 

k=M+l ^ ^ 



Combining inequality US] and HH we have, 



M - V M V" "" VI 7 - V A/ V" Vk 

where the second inequahty follows from the Cauchy-Schwartz inequality. 
Hence, 

\\hh < WhrJh + WhTs.h < + + 

From the above inequality, we can see that the task remaining is to upper bound ||/itoiI|2- Before 
deriving this bound, we first derive a bound for X]j>2 W^Tj II2 as an intermediate step. 

Observe that the magnitude of each components in Tj+i is bounded by the average of the 
magnitudes in Tj: 

\hT^^Ak)\<\\hT,\\i/M. 

Then by taking squares at both sides and then summing up from k = jM + 1 up to /c = (j + 1)M, 

\\hT,^A\l<\\hT,\\l/M 
We take the square-root of both sides and sum up from j = 1 up to the end: 

Combining with inequality 1431 we have 

ElNlb.;|(l|/..Jb.^) (4T) 

Now ||/jtoi II2 can be bounded from the RIP property in the following way. 
= \\Gh\\2 = WGToMoi+Y^GT.hT^h > WGtoMoAI - Yl WGr.hT.h 

j>2 j>2 

> y/l - hl+kWhTpih - \/l + (^M ||/tTj II2 

> Vl - '^M+fc||Hil|2 - \/'^ + ^M\J^ (^WhToh + ^^^^^ ) 
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where the second last inequality follows from inequality U?) This implies that 

iiHiib < yr+Ji^yj^^/CM, (48) 



where Cm = y^l — ^M+k — VI + ^M\J jj- Finally, combining inequality 145) and l48| we got 

/ Vl + h,{l+ \ft) \ 2\\w\U , , 

We choose M = 2k such that Cm are positive constants and this proves the theorem. 

Note that we need to ensure Cm to be positive, i.e, \/l — Sk+2k — \/l + ^2k > which 
implies 

{l + S2k)l + 53k<l. (50) 

In [27j . the authors prove that for positive integer c and r, it follows that 5cr < c • 52r- Applying 
this inequality in condition [SDl we only need to ensure 62k ^ 7 • 

7.2 Proof of Theorem 1331 

We only prove part (A) and part (B) follows along similar reasoning. First note that the number 
of false positives Nf <m — k since any optimal solution to LP is a basic feasible solution. Suppose 
C\ I is the pseudo-inverse of G2,/. Then 

C\jy2 = G\j{C2{x + W2)) =X + W2,I + G\jG2JcW2Jc. (51) 

We denote 62 := G2j<='W2j':- Since each element of G2,/c is i.i.d. Gaussian with variance ^, each 
component of 62 is also i.i.d. Gaussian with variance 

n - HI 2.^2./ 1 1 Y ^ 2 



Var(e2) < —e^ < — < , • , = 

^ m ~m ~ \8 + 2di + 2d2\/2Togn 2y/2\ogn) ^ 

The singular value decomposition(SVD) of G2,/ gives us G2,i = UJ:V^ where U,V are or- 
thonormal matrices and S G ]^2mx(fe+7V/) jg diagonal matrix. By [23] the pseudo-inverse of G2,/ 
is G\ J = V'E^U'^ , where S"^ is the pseudo-inverse of S. 

Now the reconstruction error can be represented as ^2^-62 = VT,^U'^e2- Since 62 is i.i.d. 

Gaussian as shown above and U are orthonormal matrix, C/^e2 is still i.i.d Gaussian with the same 

2 

distribution as 62- This means {T,^U'^n2)i is independent Gaussian variable with variance < 

The matrix G2J has 2m rows and k + Nf < m columns. The smallest singular value of G2,/ is 
> 4(1 Hence the variance of {T,^U'^e2)i is < ^"^1 ^ By applying Lemma [521 we have 

^2 ^'-~V2> 



|y(StC/^e2)||oo < 2ei(l--^)^V21ogn 



1 3 

< T— ^ < -, W.p. > 1 



(1 - ^)(8 + 2(ii + 2^2^!^) " 8' 

It is clear from the assumption of the Theorem that ||ti'2,/||oo ^ | and finally we can bound the 
reconstruction error in equation \UT\ as 

13 1 

\\w2,i + gI jG2,icW2,ic\\^ < ||'u;2,/||oo + ||G'^/G'2,/<='u;2,/'=||oo <- + - = - 

' ' o o Z 
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7.3 Proof of lemma [5.41 

Suppose e ~ M{0, e2/™x'"). We write the SVD of as G'^ = UT.V'^ where U G M"^''", V G M™^'" 
are orthonormal matrices and S G ]^"i-x"i g^],g diagonal matrix. Then w can be reformulated as 

w = UT.-^V^e (52) 

Since V is orthonormal, V'^e is still Gaussian with the same distribution as e ~ AA(0, e^/"*^™). 

Conditioned on all Sjj's being lower-bounded by \{^fC — 1), the variance of {Yi~^V'^ e)i is 
< . By applying Lemma (521 by conditioning on G we have 

(VC — 1) 

||w^||oo < 2e(\/C' — 1)^^1/2 log n with probability > 1 



V^r log 



n 



On the other hand, the concentration property of smallest singular value in Lemma 15.31 implies 
that Sjj > 2(vC — 1) with probability > 1 — e s . Therefore by applying Lemma 15.21 

||w^||oo < 2e(v C — 1) a/2 log n with probability > 1 — p s 

Vvr logn 

To compute the ^1 bound we proceed as follows: we bound the squared £2 for a fixed G. Note 
that for a fixed G, the noise w is a zero mean Gaussian random variable as before. We know that 

all the singular values of G are lower bounded by [y C — l)/2 with probability > 1 — e s 
The following computation are done for a given G whose smallest singular values of G are lower 
bounded by {y/C - l)/2. 

From Cauchy-Schwarz inequality, we have \\w\\\ < n\\w\\2- We know from the previous discussion 
that 

m 

Ml = Y^^^^'V^e)} 
1=1 

where each (S~^l/^e)j are independent zero-mean Gaussian r.v with variance upper bounded by 

(v^-l)2 • 

Suppose t is random variable with degree m. We have 

Pr 1^11^11^ < • 2m|G^ > Pr (t < 2m) > 1 - 2.24""" (53) 

where the last inequality follows from the tail probability of distribution. 
Finally if we take into account of all possible G's, we have 

Pr < 2mn > 1 - 2.24"'" - e s (54) 

The lemma follows by noting that n = Gm. 
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p/|w| =(2(12 log n)" +2)0 
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